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Abstract. In this paper we show that a dynamical description of the protein folding process 
provides an effective representation of equilibrium properties and it allows for a direct inves- 
tigation of the mechanisms ruling the approach towards the native configuration. The results 
reported in this paper have been obtained for a two-dimensional toy-model of amino acid se- 
quences, whose native configurations were previously determined by Monte Carlo techniques. 
The somewhat controversial scenario emerging from the comparison among various thermo- 
dynamical indicators is definitely better resolved relying upon a truly dynamical description, 
that points out the crucial role played by long-range interactions in determining the charac- 
teristic step-wise evolution of "good" folders to their native state. It is worth stressing that 
this dynamical scenario is consistent with the information obtained by exploring the energy 
landscapes of different sequences. This suggests that even the identification of more efficient 
"static" indicators should take into account the peculiar features associated with the complex 
"orography" of the landscape. 
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To Linus Torvalds 

What can be thought, can be simulated. 



1. Introduction 

Proteins are heteropolymer chains made of aminoacids. The aminoacid se- 
quence (the so-called primary structure) determines the native configuration 
(tertiary structure) which, in turn, is responsible for the biological activity 
of the protein. The identification of the native structure corresponding to a 
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given aminoacid sequence and, viceversa, of the sequence yielding a given 
configuration are called direct and inverse problem, respectively. In spite of 
the increasing efforts made by the researchers working in this area, both prob- 
lems remain generally unsolved. A few different strategies have been adopted 
so far by the scientific community to tackle the protein-folding problem. The 
first method that has been developed could be called "black-box" approach, 
since one tries to infer the tertiary structure with no other knowledge than the 
configurations corresponding to some specific aminoacid sequences (e.g., the 
neural-network approach). Although this method has been implemented with 
some success, the lack of information about the physics of the underlying 
folding process does not allow going beyond statistical predictions. In order 
to overcome such difficulties, simplified Hamiltonians have been introduced 
with the goal of identifying the native structure through the implementation 
of equilibrium-statistical-mechanics tools (e.g. Monte Carlo techniques). The 
main difficulty of this approach arised from the huge number of relative min- 
ima, which makes the search for the absolute minimum rather questionable 
in realistic cases. 

However, it is known that, in spite of the very many accessible configura- 
tions, the protein folding turns out to be rather fast, actually, much faster than 
a pure random search (once the appropriate time scales are taken into account) 
(Creighton, 1993). It is, therefore, rather tempting to tackle the problem from 
a pure dynamical point of view, following, e.g., the evolution of "coiled" 
configurations towards globular-folded structures. An "ab initio" approach, 
where all molecular forces acting among the protein elements and between 
protein and solvent are taken into account, should, in principle, reveal all 
details of the folding dynamics. Unfortunately, even if the degrees of freedom 
of the solvent are traced out from the interaction Hamiltonian, the charac- 
teristic times associated with the microscopic dynamics are on the order of 
0(10"^^) seconds, while the folding process is expected to occur typically 
on time scales in between 0(1O~^) and 0(1) seconds. Simulating systems 
with thousands of degrees of freedom over time scales that cover ten orders 
of magnitude is definitely out of reach for the actual computing facilities and 
it will remain as such at least in the near future. 

On the other hand, it appears reasonable to conjecture that the fine details 
regarding the interaction structure and the degrees of freedom correspond- 
ing to the inner dynamics of the aminoacids do not matter for the folding 
process. Therefore, one can employ "coarse grained" potentials, epitomizing 
only a few relevant interactions. The price payed for such a drastic reduction 
of the gigantic complexity of the molecular structure of a protein should be 
hopefully compensated by the possibility of obtaining a reliable description 
of the folding process (provided the main ingredients ruling such a process 
have been correctly identified). 
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In fact, it seems that evolution has selected proteins out of all possible 
aminoacid sequences in such a way that their native states are stable and ki- 
netically accessible, so that only those sequences satisfying both requirements 
are biologically active. In fact, a great deal of papers has been devoted to the 
attempt of identifying "bad" and "good" folder sequences, relying upon their 
structural or equilibrium properties (Camacho and Thirumalai, 1993; Klimov 
and Thirumalai, 1996; Shakhnovic, 1994; Sali, Shakhnovic, and Karplus, 
1994; Irback and Potthast, 1995; Irback et al., 1997). 

With reference to a 2D off-lattice model, in this paper we show that strictly 
dynamical simulations can provide a full acount of heteropolymer properties. 
In particular, equilibrium simulations allow for an effective identification 
of the lowest minima of the energy landscape. Moreover, the comparison 
between folding and unfolding simulations shed some light on the glassy 
transition, while the peak of the specific heat is clearly resolved to locate the 
collapse transition. Throughout the paper we compare the behaviour and the 
properties of five sequences, suitably selected to investigate the differences 
between possible "good" and "bad" folders. 

More specifically, in section II we introduce the mesoscopic 2D off-lattice 
model. Equilibrium thermodynamic properties of the five selected sequences 
are discussed in section m (looking both at standard observables such as the 
total energy and the average distance between configurations). The confor- 
mation of the native valley and the associated energy funnel are investigated 
in section IV, while section V is devoted to the description of the dynamical 
evolution. Concluding remarks are reported in Sec. VI. 



2. The Model 

We will consider a shght generahzation of the 2-dimensional off-lattice model 
recently introduced by Stillinger et al. (Stillinger et al., 1993) and similar to 
that one previously studied by lori et al. (lori, Marinari and Parisi, 1991). 
Such a model is characterized by L point-like monomers (mimicking the 
residues of a heteropolymer) arranged along a one dimensional chain. The 
nature of the residues is assumed for simplicity to be of two types only: 
hydrophobic (H) or polar (P). Thus, each heteropolymer is unambiguously 
identified by a sequence of binary variables {^j} (with i = 1, . . . , L) along the 
backbone, where = 1 if the ith residue is of type H and = — 1, otherwise. 
The intramolecular potential is composed of three terms for each monomer: a 
nearest-neighbour harmonic interaction {Vi), a three-body interaction (V2) to 
simulate the energy cost of local bending, and a Lennard- Jones-like (LJ) in- 
teraction (V3) acting between pairs (i, j) of non-neighbouring residues. This 
last term depends on the nature of the residues, i.e. on both and ^j, in such 
a way to mimic the interaction with the solvent. 
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The Hamiltonian of the system writes as 

i=l i=l i=2 1=1 j=i+2 

(1) 

where the mass of each monomer is assumed to be unitary, {px,i,Py,i) = 



{xi.i/i), and j = ^{xi — XjY + (y^ — VjY. The first potential term ap- 
pearing in Eq. (1) is 

Vi{ri^i->ri) = a{ri^i+i - rof (2) 

with a = 20 and tq = 1; the second term, favouring the chain alignment, 
reads 

vm = (3) 

where 

{xi - Xi^i){xi+i - Xi) + {vi - yi^i){yi+x - Vi) 

cos Ui = (,4 ) 

and — TT < 9i < TT. The last, nonlocal, interaction is 



where |z — j| > 1 and 



Accordingly, the interaction is attractive if both residues are either hydropho- 
bic or polar (since Cij = 1 and 1/2, respectively), while it is repulsive if the 
residues belong to different species (cij = —1/2). The only difference with 
the model introduced by Stillinger et al. (Stillinger et al., 1993) comes from 
the nearest-neighbour interaction: the originally rigid bond is here replaced 
by the harmonic term Vi. We have preferred this latter choice, because it 
represents a more reaUstic nearest neighbours interaction. Anyhow, the large 
value of the coupUng constant a herein adopted makes the difference rather 
irrelevant. 

Quite accurate Monte-Carlo (MC) simulations, performed by employing 
innovative schemes, have revealed that, analogously to real proteins, only a 
few sequences fold into a native structure (good folders), while the majority 
of the possible sequences do not possess a unique folded state (Irback and 
Potthast, 1995; Irback et al., 1997). 

The dynamics of the toy model (1) has been investigated by integrat- 
ing the corresponding Hamilton- Jacobi equations in the presence of a heat 
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bath. The thermal reservoir has been simulated by separately implementing a 
Nose-Hoover thermostat for each residue of the chain, while the integration 
has been performed by employing a second order Runge-Kutta scheme. The 
evolution equations read 

~ Px,i ) ili ~ Py,i (6) 

(8) 

where Q represents the "bath" variable that acts to keep the temperature of 
the ith residue at the constant value T^, and r is the "reaction" time of the 
bath (typically set equal to 1 in our simulations). Numerical integrations have 
been performed with a time-step 5t = 0.025 , after having verified that this 
value is small enough to guarantee a good accuracy. 

Two different kinds of dynamical simulations have been performed, namely 
unfolding (US) and folding (FS) simulations. In the first case, the initial state 
of the "protein" is taken equal to the native configuration (NC), that we as- 
sume to coincide with the minimal energy configuration. Thermodynamic 
quantities have been thereby determined by averaging fixed-temperature sim- 
ulations over a time interval i ~ 5 • 10^. FS's have instead been performed 
starting from an initial configuration generated by setting the residues at a 
fixed distance j = ro with randomly distributed angles Oi within the inter- 
val [— 7r/4; 7r/4]. The system is then let relax for a time tr that has been fixed 
depending on the simulation temperature (from ~ 5 • 10^ to tr ~ 10^ in 
the temperature range considered later on). After this transient, the various 
observables have been averaged over a time interval ranging from 5 • 10^ to 
1.9-10^. Additionally, we have averaged over 10 different initial conditions. 

In order to investigate the folding properties of this toy model, we have 
studied a homopolymer of length 20 and 4 heteropolymers each composed of 
14 H-type and 6 P-type residues. To be more specific, we have analyzed the 
dynamical and thermodynamical properties of the following five sequences : 

- [SO] a homopolymer composed of hydrophobic residues (i.e., = — 1, z = 
1,...,L); 

- [S 1]=[HHHP HHHP HHHP PHHP PHHH] a sequence previously ana- 
lyzed in (Irback et al., 1997), where it was identified by the code number 
8 1 and recognized as a good folder for the StiUinger model (StilUnger et 
al., 1993); 

- [S2]=[HHHH PHHP HPHP HHHH PHPH] the sequence with the maxi- 
mal Z-score (Bowie et al., 1991; Mirny and Shakhnovich, 1996) within 



paper.tex; 1/02/2008; 15:57; p. 5 



6 A. Torcini, R. Livi, & A. Politi 

an ensemble of 6,900 sequences each composed of 14 H- and 6 P-type 
residues ^; 

- [S3]=[PHPH HHHH HHPH HHHHP HHPP] a sequence identified by 
the code number 50 in Ref. (Irback et al., 1997), where it was recognized 
as a bad folder; 

- [S4]=[PPPH HPHH HHHH HHHP HHPH] a randomly generated se- 
quence of 14 H- and 6 P-type residues. 



3. Equilibrium Properties 

3.1. Standard thermodynamic observables 

Before investigating the protein-like properties of the heteropolymer dynam- 
ics, we have investigated standard equilibrium-thermodynamics observables. 
Let us start defining the temperature as 

r=^^E^^^^, (9) 

where the Boltzmann constant has been set equal to one, while (•) denotes a 
time average along the trajectory in the phase space (notice that the thermal 
baths defined in the previous section induce a canonical-ensemble measure in 
the phase space). 

In all cases, at sufficiently large temperatures, the averages obtained from 
US's and FS's do coincide: this indicates that the time span of the simulations 
is long enough to guarantee a good equiUbration of the measure. At lower 
temperatures, the heteropolymer structure can be trapped in local minima of 

' The definition of tiie Z-score is 

Z = {Vnc - {V})/W 

where Vnc is the potential energy of the NC, (V) is the average potential energy of a suitable 
set of alternative configurations and W = ^/{V^) - (V)'^. In order to select such config- 
urations, we have first identified 467 distinct inherent minima (see Sec. 4 for the definition) 
of the homopolymer. These minima have been then considered as the initial condition for a 
gradient method to identify the closest local minimum for each sequence. For the sequence 
S2, Z = -5.70, while for SI, Z = -2.97 (notice that in both cases the NC does not belong 
to the set of alternative configurations over which the average has been performed). 

Moreover, for each of the five sequences studied in this paper, the Z-score has been evalu- 
ated also identifying the "alternative" configurations with the inherent minima as determined 
from simulations performed at a temperature T = 0.08. In this case, the values are Z = —4.50 
[SI], Z = -3.16 [S3], Z = -3.08 [S4], Z = -2.98 [S2] and Z = -2.20 [SO]. Accoridngly, 
the best folder seems to be S I, while the worst one is SO. 
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the potential, thus yielding different results for the finite-time US's and FS's . 
This is illustrated in Fig. 1, where we have reported U(T) for the sequence 
SI. Although the difference between US's and FS's depends on the time span 
used in the averages, the expected exponential growth of the time needed to 
visit ergodically the whole phase-space makes it sensible to introduce a rough 
definition of "glassy" temperature, Tq, as the temperature below which the 
relative difference between the values of the internal energy estimated with 
the two procedures is larger than 10%. 



15 
U 

10 



5 







-5 

0.1 0.2 0.3 ^ 0.4 

Figure 1. The total energy U (T) in equilibrium simulations for the sequence S 1 : the solid line 
corresponds to US's, while the symbols refer to FS's. The dashed lines correspond to (from 
left to right), Tq, Tf, and Te (notice that for this sequence, Te w Tg). 



The slope of U{T), i.e. the specific heat Cy, exhibits a clear peak at 
a temperature Tg « 0.1 . Although one cannot speak of phase-transitions 
in finite systems, this behaviour is definitely reminiscent of the ^-transition 
firstly studied in homopolymers (De Gennes, 1979), where a low-temperature 
phase, characterized by compact configurations, and a high-temperature phase, 
characterized by random-coil states, have been identified. In the context of 
protein-like chains, this translates into the so-called collapse transition (Ca- 
macho and Thirumalai, 1993; Klimov and Thirumalai, 1996), that is iden- 
tified as the temperature corresponding to the maximum of Cy. Moreover, 
this U (T) dependence is also peculiar of systems with attractive interactions, 
where the collapse transition occurs when such interactions become domi- 
nant over the other energy contributions (see, e.g., self-gravitating systems, 
atomic and molecular clusters) (Antoni and Ruffo, 1995; Torcini and Antoni, 
1999; Haberland, 1995). 
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In practice, the specific heat is better estimated by looking at the fluctua- 
tions of the internal energy, 

The numerical results obtained for the sequences S0-S4 are reported in Fig. 2. 
There we see that all curves start from Cy — 40 to exhibit a more or less 
broad peak. In fact, at sufficiently low-temperatures, any sequence is practi- 
cally indistinguishable from a (disordered) 2d solid, in which case the specific 
heat is equal to 2L. At high temperatures, Cy seems to converge to some 
lower value. In a generic chain with nearest-neighbour interactions in a plane 
we expect a T/2 contribution from each one of the kinetic degrees of freedom 
and only one T/2 contribution from the components of the potential energy, 
dominated by the longitudinal interaction. Altogether this implies that the 
specific heat should be Cy ^ SL/2 = 30. In practice, we find sfightly larger 
values, as shown in Fig. 2. The difference has to be attributed to the Lennard- 
Jones potentials that are not yet completely neghgible at temperatures close 
to 0.4 (the energy contribution of the angular term turns out to be fairly inde- 
pendent of T). In fact, the interparticle distance grows with the temperature 
and, accordingly, the LJ energy increases as well, converging to from below. 



70 
60 
50 
40 



0.05 0.1 0.15 T 0.2 

Figure 2. The specific heat Cv as a function of T. All data refers to US's. 

Moreover, we notice that S 1 and S2 exhibit both the largest collapse tem- 
peratures and the highest Cy -peaks. For the other three sequences Tg is 
smaller by approximately a factor of two, while the peak value of Cy is 
15 — 20% lower than those obtained for SI and S2. This suggests that max- 
imizing the zeta score is a good strategy at least for optimizing the collapse 
transition. This can be investigated more directly by looking at the gyration 
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radius 




where rem denotes the center of mass of the sequence and the average is 
performed over configurations generated by the dynamical evolution of the 
system. The indication for the collapse transition is usually associated with a 
rapid decrease of Rgy{T) close to a temperature T = Tq, where its derivative 
should exhibit a singularity. However, as already observed in (Irback et al., 
1997), the finiteness of the chains can at most yield a smooth decrease of 
Rgy{T), being the singularity intrinsic to thermodynamic hmit properties. In 
finite systems a generally accepted estimate of Tg is the temperature at which 
dRgy{T) l&T is maximal. 




« ♦ S4 

1.5 I ■ ' • ' • 

0.0 0.1 0.2 T 0.3 

Figure 3. Gyration radius as a function of the temperature for all S0-S4 sequences. The data 
refers to US's. 

With this definition, we obtain for essentially the same value of Tq for 
the sequences SI and S2, while a >> Tq for all the other 3 sequences. 
Obviously, the homopolymer turns out to be the most compact sequence at all 
temperatures, as all LJ-potentials are attractive. Coherently with the previous 
analysis, we can notice a similar behaviour for 51 and SI which again display 
a more pronounced transition-like behaviour. A further observation concerns 
the relatively larger gyration radius exhibited by S4 at low temperature: we 
can imagine that the frustration of its typical random structure prevents the 
formation of a more compact "native" configuration. 

Some of the indicators computed for the S0-S4 sequences are reported 
in Table I. The largest value for the temperature of the "glassy" transition is 
obtained for SI (namely Tq 2± 0.048). Within the framework of the random 
energy model applied to the protein-folding problem, it is commonly believed 
that a good folder is characterized by a large ratio p = Tf/Tq (Onuchic et 



paper.tex; 1/02/2008; 15:57; p. 9 



10 



A. Torcini, R. Livi, & A. Politi 



al., 1995) (where Tp is the folding temperature, defined in Subsec. 4.2). From 
the data reported in Table 1, SI (identified as a good folder in (Irback et al., 
1997)) should be definitely the worst one! It is not clear whether such an 
akward conclusion implies that p is not a meaningful indicator as hoped for, 
or whether the sequence S 1 is not a good folder as previously believed. 

Table I. For all S0-S4 sequences we report: the glassy 
temperature Tg ; the ratio p = Tp /Tg (the folding temper- 
ature Tp can be found in Table III); the collapse-transition 
temperature as estimated from the gyration radius (Te) and 
from the specific heat (Tg* ); the maximum value of Cv ■ 





SO 


SI 


S2 


S3 


S4 


Tg 


0.022 


0.048 


0.028 


0.038 


0.025 


P 


1.63 


1.36 


1.71 


~ 0.95 


1.84 


Te 


0.16 


0.11 


0.13 


0.13 


0.13 


Tg* 


0.05 


0.105 


0.12 


0.06 


0.06 


CviTe*) 


60 


68 


72 


57 


55.5 



3.2. Distance between confingurations 

In order to study protein-like features of equilibrium simulations, it is impor- 
tant to look at the shape of the heteropolymer and, in particular, to quantify 
differences between shapes. Given any two configurations a and /?, identified 
by the angle sequences 9\°'\ 9^^^ (2 < i < L — 1, see Eq. (4) ), we introduce 
the following angular distance 

d{a,P) :=min(^ X: 1^1") -^f) | ; (11) 



1= 



where the minimum is taken over reflections only, since this distance is in- 
variant with respect to translations and rotations of any single configuration. 
Tipically, we are interested in looking at the angular distance between any 
dynamical configuration of a sequence and its corresponding NC. For the sake 
of brevity, we indicate this distance with 6, omitting the explicit dependence 
on the generic dynamical configuration. In the following, we will show that 
5 provides essentially the same information as the structural overlap function 
(Camacho and Thirumalai, 1993) 
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where rij is the distance between the zth and jih residues of a given config- 
uration, is the corresponding distance in the NC, G(x) is the Heaviside 
function and e denotes a suitably chosen threshold. In order to compare this 
indicator with 6 we have adopted a slightly different, but practically equiva- 
lent, definition by replacing Q{x) in Eq. (3.2) with (1 -|- tanh(/«a;))/2. It can 
be readily verified that 1/k plays the role of s; in our calculations we have 
set K = 1. The average angular distance {6) is reported in Figs. (4-5) for all 
S0-S4 sequences with both FS's and US's. At sufficiently high temperatures 
(T > To) the results of FS's and US's concide, as the observables introduced 
in the previous section do. Typically, {S) increases with T and eventually 
saturates to a sequence-dependent asymptotic value 6a for T > 0.2 . The 
comparison with the behaviour of the average structural overlap (x) (reported 
in Figs. (4-5) for US 's) indicates that this quantity is practically equivalent to 
{6) , the only difference being an irrelvant scale factor 

The dependence of {6) on the temperature T, obtained with the US's, 
gives a first rough idea of the shape of the native valley. The slower growth 
observed for S 1 suggests that this sequence is characterized by a wider basin 
of attraction. Some information about the "accessibility" of the native valley 
can, instead, be extracted from the difference between (S) obtained with FS's 
and US's. If, during the folding dynamics, the heteropolymer is unable to 
enter the native valley in a broad temperature range (within the employed 
integration time) this is a strong indication that the corresponding sequence 
is a slow folder. 




Figure 4. The average distance {5) for US's (solid line) and FS's (circles) as a function of the 
temperature T, reported toghether with the average value of the structural overlap function 
(x) for US's (dashed line) in the homopolymer SO. 
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2 



1 



0.5 

s 

^ 

— 

1 



0.5 


0.2 0.4 0.2 0.4 

Figure 5. The same quantities as in the previous figure for the sequences S1-S4. 

In order to compare the different sequences on a more quantitative level, 
we have introduced the "foldabiUty" quality-factor 




where is the minimal value reached by {6) during the FS's. A high Q-value 
indicates that the protein noticeably approaches the native structure before 
the structural arrest sets in below T = Tq. Conversely, a relatively small Q- 
value suggests that the protein does not even enter the native valley before 
the dynamics is dramatically slowed down at the glassy transition. The data 
reported in Table II indicate that the largest Q- values are obtained for 5*1 and 
^2, indicating that the only two "good" folders are 51 and 52. 

Table II. The foldability factor Q, the temperatures 
T^, Tp, (7x, (Ts, and a* for the 5 considered se- 
quences. The values reported within parentheses for 
S2 correspond to a second peak shown by x for this 
sequence. 





SO 


SI 


S2 


S3 


S4 


Q 


1.37 


2.63 


2.44 


1.37 


1.92 


rpX 


0.05 


0.07 


0.05 (0.12) 


0.04 


0.05 


r|. 


0.04 


0.10 


0.07 


0.06 


0.07 




0.09 


0.33 


0.58 (0.00) 


0.33 


0.55 


as 


0.27 


0.05 


0.41 


0.00 


0.37 


a* 


0.78 


0.41 


0.63 


0.71 


0.64 



S1 

• 


S2 ^.^ — • • 






S3 , , 


— 1 \ ^ \ 

S4 




• ♦ — • • 
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In (Camacho and Thirumalai, 1993; Klimov and Thirumalai, 1996) it was 
suggested that the variance of x is a meaningful indicator to define the folding 
temperature. More precisely, it is claimed that, analogously to the collapse 
transition identified from the maximum of the fluctuations of the internal 
energy, the folding temperature corresponds to the peak of the variance of 
the structural overlap x- The almost equivalence between % and the angular 
distance 6 suggests that one could equally look at 

AS = {{6f) - {{6)f (12) 

where the average (•) is taken over all the possible configurations taken by 
a heteropolymer (with a given sequence) during a certain time evolution and 
over different initial conditions. Accordingly, the temperature corresponding 
to the peak of the variance AS should identify the folding temperature. As it is 
not a priori obivous that this second definition coincides with the former one, 
we shall denote it by Tp. The results of US's for both Tp and Tp are reported 
in Table 11 (those referring to Tp are confirmed by independent FS's). One 
can see that they are qualitatively similar, but not really close to each other. 
This should be taken as an indication that the concept of "folding transition" 
is ill-defined as all transitions in finite systems. 

The "Camacho-Klimov-Thirumalai criterion" states that when the param- 
eter 

_ \TS-T^\ 

(J V — 



is small (e.g., < 0.4 for off-lattice models), the corresponding sequence is 
a fast folder, while slow folders are characterized by > 0.6 (Veitshans 
et al., 1997; KUmov and Thirumalai, 1996). In our case we have estimated 
both o^ij") and (J5(T) (with an obvious meaning of the notation) and the 
corresponding values are reported in Table 11. As a first observation, notice 
that (y^(T) and cr5(r) do not give coherent indications. By looking at the 
values of we are lead to the "absurd" conclusion that the only good folder 
is the homopolymer and that all other sequences appear as not-so-fast folders. 
On the other hand, by looking at a^, one deduces that the fast folders are 
and S'i, a conclusion that is still partly inconsistent with (Irback et al., 1997), 
where it was ascertained that S3 is a bad folder. Thus, we can only conclude 
that in our case, the Camacho-Klimov-Thirumalai criterion does not help in 
properly identifying good folders. 

However, if we replace Tq with Tq (i.e. if we look at the peak of the fluctu- 
ations of the gyration radius) and define Tp as discussed in the next chapter 
(see Table III), we obtain the much more meaningful indicator a* . In fact, 
from the data reported in the last row of Table 11, one concludes that 51 is the 
only reasonably good folder. Clearly, such differences indicate that finite-size 
corrections are too important to be neglected in this type of studies. It would 



paper.tex; 1/02/2008; 15:57; p. 13 



14 



A. Torcini, R. Livi, & A. Politi 



be really useful to study a much larger ensemble of systems to conclude 
whether a* is a truly trustful indicator. In any case, it remains to be understood 
why cr* is more meaningful than other, apparently equivalent, indicators. 



4. Energy Landscape 
4.1. Native Configurations 

Assuming that the native configuration of each sequence coincides with the 
absolute minimum of the energy, we have determined it by first looking for 
the local minima "visited" by a sufficiently long trajectory at a fixed, not- 
too-low, temperature value (tipically, T = 0.08). More precisely, we act 
as follows: we sample the trajectory every At time units and consider the 
configuration (xi(nAt), yj(nAt), Xj = 0,yj = 0) as the initial condition for 
the overdamped dynamics 

Xi = , (13) 

1 OXi 

where 7 is an irrelevant damping constant fixing the time scale (an equivalent 
equation holds for y,). As a consequence, the phase point evolves towards 
the local minimum, the basin attraction of which contains the initial condi- 
tion. Such a local minimum is the so-called "inherent state" (Stillinger and 
Weber, 1984). Upon repating this procedure over and over, we can construct 
a large ensemble of inherent minima: the minimal-energy state is eventually 
identified with the NC. 

By comparing with (Irback, 2000), we have verified that, for the sequences 
SI and S3, this method allows for a correct identification not only of the NC 
but also of the 10 lowest energy configurations (with a few differences due 
to the contribution of the harmonic potential Vi, replaced by a rigid bond 
in the original model (Stillinger et al., 1993) ). Considering that our method 
is rather straightforward in comparison to the quite elaborate Monte Carlo 
techniques implemented in (Irback et al., 1997), this is a first indication of 
the advantage of the dynamical approach. To be more precise, simulations 
for a time t = 250, 000 (with sampUng-time At = 5 and minimization-time 
t = 125 - 7 = 2.5) typically allow for a correct identification of the NC and 
of the lowest energy minima with an accuracy of 10~^ — 10~^ in energy and 
10-2 for what concerns the angular distance S. 

The distribution of polar residues is such that the formation of a stable 
hydrofobic core is possible in SI and S2, while the concentration of the 
residues at the ends in S3 and S4 induces the formation of fairly unstable 
filaments. 

In Table m we report the potential energy Vnc of the native state to- 
gether with the energy gap Vgap separating the NC from the second lowest 
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Figure 6. The NCs corresponding to the five examined sequences. Full and open dots denote 
polar and hydrophobic residues, respectively. 

energy level. As expected the lowest energy corresponds to the homopolymer 
sequence, while the other native-state energies are quite close to each other. 
The largest gap is, instead, found for the sequence S 1 : it is more than 3 times 
larger than the V^^p-value for SO and S2 and more than 30 times larger than in 
S3 and S4. According to the "Shaknovich criterion" (Shakhnovic, 1994; Sali, 
Shakhnovic, and Karplus, 1994), a protein with a large energy gap between 
the NC and the first non-native (compact) configuration folds rapidly. There- 
fore, one expects that SI is a much faster folder, while S3 and S4 should 
really be slow folders. On the other hand, it has been recently shown that the 
folding dynamics depends on the whole energy landscape and not only on the 
energy gap (Pitard and Orland, 2000). In this sense, one should consider that 
such criteria may provide useful guidelines for an approximate identification 
of good folders. 

4.2. Folding Temperature Tp 

A commonly used definition of the folding temperature Tp (i.e. of the tem- 
perature below which the polypeptide chain is predominantly in the native 
configuration) states that Tp is the temperature for which the probability to 
visit the NC is 1/2. At finite temperatures, in off-lattice simulations, the NC 
is never exactly met: this implies that the implementation of the above def- 
inition requires defining a "visit" as a "close-encounter" with the ambiguity 
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Table HI. The minimal potential energy Vnc (corresponding to 
the NC) is here reported together with the energy gap Vgap, the 
number of nearest neighbours local minima of the NC, and the 
folding temperature Tp for the 5 considered sequences. 





SO 


SI 


S2 


S3 


S4 


Vnc 


-7.0422 


-4.6666 


-5.1234 


-4.6283 


-4.6661 


Vgap 


0.0255 


0.0922 


0.0244 


0.0025 


0.0017 


no 


6 


38 


33 


3 


28 


Tf 


0.036 


0.065 


0.048 


0.036 


0.046 



connected with the relative. One could simply state that the NC has been "vis- 
ited" whenever the phase point enters its basin of attraction. In practice, we 
have verified that this is too narrow a criterion to be utilized for a meaningful 
definition of Tp. Accordingly, we have preferred to compute the probability 
P{T) to visit the basin of attraction of either the NC or one of its neighbouring 
metastable states: in what follows we shall refer to this ensemble of attraction 
basins as "native valley". The definition of Tp is, therefore, imphcitely given 
by the constraint 

P{Tf) = 0.5 . 

The metastable states have been identified by following the sequence of min- 
ima visited during unfolding-dynamics simulations. In fact, if the temperature 
T is neither too small nor too high, a generic trajectory explores the native 
valley jumping among all minima around the NC. As a result, we have ob- 
served that the number no of minima surrounding the NC is maximal for the 
sequence SI, while it reduces dramatically for SO and S3 (see Table 111 for 
more details). This indicates that many different pathways can lead to the 
folded structure in the case of SI, S2 and S4, while a few paths exist for SO 
and S3. 

The probability P{T) to be in either the NC or one of its no neighbours 
has been measured at different temperatures for all the sequences. The data 
reported in Fig. 7 reveal quite an abrupt decrease of P{T) for both SO and 
S3, while a smoother behaviour has been found for the three other sequences. 
The corresponding folding temperatures are reported in Table III. The highest 
value is found for SI (Tp = 0.065), while the lowest for SO (Tp = 0.036). In 
the next section we will show that Tp coincides with the minimal temperature 
for which the NC is dynamically accessible within a "realistic" lapse of time. 
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Figure 7. The probability P(T) versus the temperature T in lin-log scales. The measurements 
have been performed during US's of duration t = 250, 000, where an overdamped relaxation 
scheme has been applied every At = 5 to find the underlying local minimum. 

4.3. Energy Funnel 

The energy landscape associated to three of the five mentioned sequences 
has been investigated by examining the distribution of local minima obtained 
by performing US's at various temperatures. For each value of T, we have 
made along simulation of duration tf = 12500, applying the above described 
overdamped integration scheme every At = 0.05 time units, to identify the 
inherent minima. As a result, we have been able to determine the number 
Nd{t, T) of distinct minima visited up to time t at temperature T. For each 
sequence, we have found that Nf = Nd{tf,T) decreases noticeably with T 
(see Fig. 8, where we have reported the results for the sequences SO, SI and 
S2). This observation, analogous to what recently found for a /?-heptapeptide 
in methanol solution (Daura et al., 1999), is quite obvious, since at low tem- 
peratures the trajectory remains trapped into local minima. In particular, we 
see that for the sequence SI, 6324 different minima have been identified at 
T = 0.09 while only 6 distinct minima have been visited at T = 0.04. 
A detailed investigation would require considering the hidden dependence 
of on the integration time tf. The practical computer- time limitations 
obliged us to study one case only. In Fig. 8 we observe that the sequence SI 
is characterized by an almost exponential increase of Nf with temperature. 
This confirms that the native valley is well connected to many other valleys 
of the energy landscape for the S 1 sequence, while much fewer connections 
are found for S2 and SO. 

A further interesting quantity to study is the temporal growth of the num- 
ber of distinct inherent minima visited. In Fig. 9, we have reported Nd{t, T) 
at various temperatures for the sequences SI and SO. It is evident that the 
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Figure 8. Number of distinct minima obtained during a US of duration t = 12, 500, with 
an overdamped relaxation scheme applied every At = 0.05 to locate the nearest inherent 
minima. 



heteropolymer, at low temperatures, is trapped in the native valley for the 
simulation duration, while for T > Tp, it starts visiting other valleys. 




Figure 9. Number of distinct inherent minima Nd{t,T) visited during a US of duration 
t = 12500. Panel (a) contains the results for the SI sequence: from bottom to top the tem- 
peratures are T = 0.04, 0.05, 0.06, 0.07, 0.08, and 0.09; panel (b) refers to the sequence SO: 
from bottom to top the temperatures are T = 0.04, 0.05, 0.06, 0.07, and 0.09. 

The nonuniform growth exhibited by (see, for instance, the highest 
temperature simulations for SI) are suggestive of the existence of different 
valleys: as soon as some specific "passes" are overcome, a new landscape 
appears making new valleys easily accessible. 



paper.tex; 1/02/2008; 15:57; p. 18 



A dynamical approach to protein folding 



19 



In order to give a pictorial description of the energy landscape, we have 
decided to define the "free energy" 

F{U) = -T\n[Q{U,T)] (14) 

where Q{U,T) is the probability that, at temperature T, the heteropolymer is 
in one of the inherent minima whose energy lies within the interval [U, U + 
SU]. The results for SI and SO are reported in Fig. 10, where Q{U,T) has 
been determined by fixing 5U = 0.1 . It is evident that at sufficiently low 
temperatures, the minimum of F{U) is located at the energy U = Uq of 
the NC, while at higher temperatures, the minimum shifts to larger values, 
indicating that the NC is no longer the favoured thermodynamical state. For 
the SI sequence, this occurs for T > 0.07, while for SO, the NC is no 
longer favoured already for T = 0.05. These numbers are compatibile with 
the previously estimated values of the folding temperature and suggest an 
alternative definition of Tp pointing more directly to the folding process as 
to a phase-transition (with all limitations due to the fact we are referring to 
finite systems). 




Figure 10. Free energy F{U) as a function ofU — Uo (where Uo is the energy of the NC) for 
various temperatures. Panel (a) refers to the SI sequence with temperatures T = 0.05, 0.06, 
0.07, 0.08, and 0.09 from top to bottom; panel (b) refers to the SO sequence with temperatures 
T = 0.05, 0.06, 0.07, and 0.09, from top to bottom. The origin of the free energy is fixed 
arbitrarily. 



5. Dynamics 

In the previous sections we have adopted the widespread attitude of describ- 
ing the folding process with concepts and tools borrowed from the language 
of equilibrium thermodynamics. This approach proves partially effective in 
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singling out differences between good and bad folders, although the heuris- 
tic criteria proposed so far reveal some degree of ambiguity. However, the 
folding process can be viewed as a transient evolution towards a uniquely se- 
lected native state. In this perspective, a dynamical description of the folding 
process seems more appropriate than a purely thermodynamic one. First, by 
looking at the evolution, one can identify the accessible pathways towards the 
native configuration and thereby estimate the folding time. Furthermore, the 
relative length of the proteins makes the use of thermodynamical concepts 
rather questionable. 

A proper observable to look at is the angular distance 6{t), defined in 
Sec. 3. In Fig. 11(a), one can notice how, for Tq < T < Tp, the approach 
to the native valley {5 0) of SI is characterized by a series of jumps 
that correspond to successive rearrangements of the chain configuration. This 
process can be better appreciated by looking at the snapshots taken in the 
various plateaus (the NC is reported for the sake of comparison). Notice that 
although the "asymptotic" average value 6a of the distance is not numerically 
too small {5a — 0.3), the dynamical configuration looks very similar to the 
NC (within a left-right symmetry transformation that in 2D has to be always 
considered). In other words 5a is quite sensitive an indicator. 

In Fig. 11(b) one can look at the evolution of the three components of the 
potential energy (see Eqs. (2)) for the same trajectory as in Fig. 11(a). The 
harmonic component Vi is, as expected, completely insensitive to the vari- 
ous structure changes, while the angular energy V2 Umits itself to displaying 
larger fluctuations in the asymptotic regime. The only interesting behaviour 
is exhibited by the long-range potential energy V3, which closely reproduces 
the jumps of 5{t) with the only exception of the first one. 
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Figure 11. A typical FS for the sequence SI at T = 0.06. (a) The distance 5 with respect to 
the NC is plotted versus time. The three upper insets contain snapshots of the configuration 
in each of the three plateaus (the NC is reported in the lower-left inset for comparison). The 

three component of the potential energy V2, Vi, and are reported in panel (b) (from top to 
bottom) for the same FS. The potential energies are arbitrarily shifted along the vertical axis. 
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The time needed for S 1 to enter the native valley is tipically on the order 
of 0(10^ — 10^) adimensional units. In physical units, this corresponds to 
0(10^^) seconds, a much shorter time than that typically observed in real 
proteins^. It seems reasonable to conjecture that the two-dimensional charac- 
ter of the space is the main responsible for the shortness of the time scale. 
Indeed, the more stringent geometrical constraints induce a faster folding 
in 2D than in 3D. Moreover, the relatively small chain length (L = 20) 
contributes to fastening the folding process, as well. 

Nonetheless, it is instructive to point out that the average time scale of 
the folding process is already six orders of magnitude larger than the typi- 
cal scales of equilibrium vibrations: this testifies to the meaninfulness of the 
model itself that is Ukely to be strenghtened by extending it to 3D. 

The behaviour of the other sequences exhibit both differences and analo- 
gies. In order to observe a convincing convergence to the NC, one has to 
consider smaller temperatures. From a biological point of view, the energy 
scale is very important, as a meaningful protein has to fold in a specific tem- 
perature range. On the other hand, as the energy scale is somehow arbitrary 
in our model, one would like to understand whether the difference between 
the various sequences reduces to fixing the temperatures at which specific 
phenomena are observed. 

The data reported in Figs. 12 and 13 refer to SO and S3, respectively. The 
evolution of S and J7 for T = 0.04 is qualitatively similar to that exhibited by 
SI at temperature T = 0.06. Nevertheless, there is an important difference 
that may have relevant biological implications: the fluctuations of 5 (and, 
similarly, of the total energy U) within what appears to be the native valley 
are larger for SO and S3 than for SI (the standard deviation is approximately 
equal to 0.14, 0.12, and 0.08, in the three cases), even though the temperature 
is comparably smaller than in the latter case. This points to a more clear 
configurational stability of the folded state of S 1 . 



6. Concluding remarks 

Five different sequences of a 2d toy-model of aminoacid chains have been 
studied in detail. Time averages of thermodynamical indicators have been 
analyzed in order to check their consistency in discriminating between one 

^ A rough estimate of the "real" time scale involved in the folding process described by our 
model can be derived from the period of small oscillations tlj ~ ■\/ ma^/e in the Lennard- 
Jones potential, where m is the typical mass of an aminoacid, e is the depth of the potential 
well and a is the equilibrium distance. In our model, m = 1 , ct ~ 1 and e — 1, while for an 
aminoacid m ~ 1 — 3 • 10^'^^ g., the equilibrium distance is 7 — 9 • 10~* cm. and the enregy 
of a hydrophobic interaction is ~ 1 — 2 • 10"^"^ erg. Therefore, tlj ~ 4 — 6 • 10~^^ sec, so 
that the folding time-scale for the sequence SI is C(10~^) sees. 
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Figure 12. A typical FS for the sequence SO at T = 0.04. The distance 5 with respect to the 
NC and the potential energy U are plotted versus time. The potential energy U is arbitrarly 
shifted along the vertical axis. 
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Figure 13. Four samples of FSs for the sequence S3 at T = 0.04. The distance S with respect 
to the NC (upper curves) and the overall potential energy U (lower curves) are plotted versus 
time. The potential energy U is arbitrarly shifted along the vertical axis. 



specimen of "good folder" (SI) and a set of four very different sequences. 
These equilibrium simulations yield controversial conclusions: the various 
indicators often attribute different rankings to the various sequences (in one 
case, SI is even located among the worst candidates for a "good folder"). 

A clear identification of S 1 as a "good folder" is provided by the foldabil- 
ity factor Q and by the Camacho, KUmov and Thirumalai criterion appUed to 
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a*. The Shaknovich criterion too points in this direction, as SI exhibits the 
largest energy gap Vgap. 

In summary, applying a "majority rule" we are led to conclude that SI 
is actually a "good folder". Nonetheless, it would be definitely better to rely 
upon less ambiguous criteria for achieving such a conclusion. The direct in- 
spection of the dynamics of the sequences provides a more clear scenario. 
Simulations performed at temperatures close to Tp, reveal that an initial 
"coil" state evolves towards its native configuration entering the energy fun- 
nel by a sequence of configurational jumps. In order to obtain a similar sce- 
nario for the other sequences, the temperature has to be significantly lowered 
and this is not sufficient, since the relative fluctuations within the native valley 
are comparably larger. 

The analysis of the structure of the energy landscape provides comple- 
mentary indications that are consistent with the dynamical description. In 
particular, the "free-energy" F{U) defined in Eq. (14) shows that the NC 
is a minimum of F{U) only below a temperature close to Tp (defined either 
by looking at the behaviour of 5 or x)- Above Tp, the minimum shifts away, 
suggesting that the stable thermodynamical state differs from NC. Moreover 
we have discovered that in 51 the ten ineherent minima of lowest energy 
are dynamically connected to the NC (i.e. all of them belong to the na- 
tive valley), while in SO and S3 only a few are directly connected to the 
corresponding NC's. This indicates that the gap height, although certainly 
important, does not provide a full account of the relevant folding properties. 
Preliminary investigations (Tiberio, 2000) suggest that such an information 
must be complemented with the "connectivity" between the NC and the other 
low-energy minima and with the height and "shape" of the barriers separating 
the NC from the inherent minima. 
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